fix: guard handle_read against closed socket race (#614)#902
Conversation
Add is_closed check at the top of LibevConnection.handle_read() to prevent recv() on an already-closed socket. Root cause: close() is called from any thread (main/app) and does _socket.close() synchronously, but the libev reactor thread may already be inside handle_read() for that connection (dispatched before _loop_will_run batches the watcher stop). Without this guard, recv() on the closed FD raises OSError [Errno 9] Bad file descriptor. This is the minimal fix for issue scylladb#614. The race is inherent in the threading model (single reactor thread shared across all sessions, close callable from any thread). The watcher stop is asynchronous (batched in _loop_will_run), but socket.close() is synchronous — this guard bridges the gap. Reproducer: scylladb/scylla-dtest#7079
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
I ran the CI job here: https://argus.scylladb.com/tests/scylla-cluster-tests/59720349-680a-44ed-bd9f-109ddf28657f/results So, I re-triggered it with different nemesis seed: scylla-staging/valerii/vp-longevity-100gb-4h-oci-test#38 |
Minimal fix for issue #614 — Bad file descriptor
2-line change. Adds
if self.is_closed: returnat the top ofLibevConnection.handle_read().Root Cause
Thread race between
close()andhandle_read():close()callsself._socket.close()synchronously (from any thread)_loop_will_run(), next loop iteration)handle_read()for that connection before the watcher is stopped,recv()hits a closed FD → EBADFThe Fix
Why this is safe
is_closedis set underself.lockinclose()before_socket.close()is calledis_closedis True, the socket is either already closed or about to be — no useful work can be doneReproducer
100% reproduction rate: https://github.com/scylladb/scylla-dtest/pull/7079
Note on existing fix branches
fix/factory-close-race-614— fixes factory returning dead connections (different race)fix/asyncio-close-race-614— fixes AsyncioConnection race (different reactor)Neither adds the
handle_readguard for LibevConnection. This PR is complementary.How to test with SCT/Hydra
@vponomaryov — could you try building hydra with this branch and check if it fixes the EBADF reports in longevity runs? This is the minimal possible fix (2 lines). If it works, we can merge this independently of the other #614 fix branches.
Closes #614